3.6. Stemming
You are almost ready to start indexing your literary classics.
But first, you must add some stemming code. As you learned earlier,
any nontrivial amount of text contains several variants of the same
words: plural forms, different tenses, and so on. Users will want to
search for any of these forms. (Imagine not finding a result on Google
because you mistakenly searched for a plural.)
To work around this issue, you store and search the “stemmed”
form of every word. Every word is converted into a root form and,
since all variants share the same root, finding the root will satisfy
searches for all variants. For example, the root for
Connect, Connection,
Connected, Connecting, and
Connections is the same:
Connect. By storing only the root stemmed term,
users can search successfully for any of the variants.
There are several algorithms to do just this. Let’s use the
Porter Stemming Algorithm. Specifically, let’s use the C#
implementation of this algorithm, available at http://tartarus.org/~martin/PorterStemmer/csharp2.txt.
Note: The implementation of this algorithm is based on Martin
Porter’s 1980 paper, “An algorithm for suffix stripping.” You can
find the original paper at http://tartarus.org/~martin/PorterStemmer/def.txt.
This file requires a little modification before you can use it
in this project. Remove the following code from the top and add it to
the project as Stemmer.cs:
using System.Windows.Forms;
[assembly: AssemblyTitle("")]
[assembly: AssemblyDescription("Porter stemmer in CSharp")]
[assembly: AssemblyConfiguration("")]
[assembly: AssemblyCompany("")]
[assembly: AssemblyProduct("")]
[assembly: AssemblyCopyright("")]
[assembly: AssemblyTrademark("")]
[assembly: AssemblyCulture("")]
[assembly: AssemblyVersion("1.4")]
[assembly: AssemblyKeyFile("keyfile.snk")]
[assembly: AssemblyDelaySign(false)]
[assembly: AssemblyKeyName("")]
Note: You don’t need the attributes, since Visual Studio already
specifies them in AssemblyInfo.cs automatically for you. The
WinForms reference is also
removed, since this is a command-line interface (CLI)
application.
The key method in the file is stemTerm. You’ll be calling this to get the
stemmed version of every word you index.
3.7. Indexing
You are now ready to index your content. To get the files,
you asked the user in the Main
method to enter index followed
by the directory path containing the files to index. Let’s add some
code to construct a Document object
out of every file in the directory, and pass it off to an Indexer class for indexing. You give every
document a unique GUID to ensure that it has a unique identifier in
the document table. (Remember that the document ID is also used as the
partition key.)
Add the following code in Program.cs in the Program class, and replace the blank
Index method with this implementation:
static void Index(string path)
{
foreach (var file in Directory.GetFiles(path))
{
string title = Path.GetFileName(file);
string content = File.ReadAllText(file);
Indexer.AddDocument(
new Document( title, Guid.NewGuid().ToString()), content);
}
}
The core indexing magic will happen in the Indexer class, which you’ll see in a
moment.
The algorithm for indexing a document is as follows:
You add the Document
object to the Document table in
Azure storage by adding it to an instance of FTSDataServiceContext.
You strip out all punctuation, carriage returns, line feeds,
and undesirable characters in the document content. You use a
simple regular expression that looks for the ASCII ranges for
punctuation and special characters, as well as .NET’s regular
expression support, to do this.
You split the content into words. Since you removed all
other special characters in the previous step, you can easily
word-break on spaces.
You construct a .NET Dictionary object, and add every new
stemmed term you see to it. You then loop over all words and, when
you come across a new word, check whether it is already present in
the dictionary. If it isn’t present already, you add it. At the
end of this process, you have a list of unique terms in the
content. To stem a term, you call stemTerm in the PorterStemmer class.
You loop through every term collected in step 4 and
construct a row for it in the inverted index table, mapping that
term to the current document’s ID. You save changes back to the
cloud by calling FTSDataServiceContext.SaveChangesWithRetries.
The following code does all of this. It follows the same set of
steps as the algorithm specified. Add it as Indexer.cs:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Configuration;
using PorterStemmerAlgorithm;
using Microsoft.WindowsAzure.StorageClient;
using Microsoft.WindowsAzure;
namespace FTS
{
public static class Indexer
{
public static void AddDocument(Document doc, string docContent)
{
var account =
CloudStorageAccount.Parse(ConfigurationSettings.AppSettings
["DataConnectionString"]);
var ctx =
new FTSDataServiceContext(account.TableEndpoint.ToString(),
account.Credentials);
//Add document to documents table
ctx.AddObject(FTSDataServiceContext.DocumentTableName, doc);
//Now let's find all the terms
var terms = new Dictionary<string,bool>();
//Normalize document - strip punctuation
var content = Regex.Replace(docContent, @"[!-\/:-@\[-\`]", " ");
//Strip line feeds and carriage returns
content = content.Replace('\r', ' ');
content = content.Replace('\n', ' ');
var possibleTerms = content.Split(' ');
foreach (var possibleTerm in possibleTerms)
{
if (possibleTerm.Trim().Length == 0)
{
continue;
}
var stemmer = new PorterStemmer(); //Cannot reuse
var stemmedTerm = stemmer.stemTerm(possibleTerm.Trim().ToLower());
terms[stemmedTerm] = true;
}
//Loop through terms and add pointer to document
foreach (var term in terms.Keys)
{
ctx.AddObject(FTSDataServiceContext.IndexTableName,
new IndexEntry(term, doc.ID));
ctx.SaveChangesWithRetries();
}
}
}
}
And that’s all the indexing code you need! At this point, run
the project.
In the console that pops up type in index followed by the path to your text
files, and press Enter. The program should churn away for a while as
it creates your inverted indexes, as shown in Figure 2. The actual time taken depends on the number of
files and the size of each file. It should be on the order of several
minutes, so treat yourself to a coffee. At the end, you will have a
fully constructed set of inverted indexes in the cloud, mapping
stemmed terms to the documents in which they appear.
3.8. Searching for a single term
Now that you have a fully constructed index, let’s write
some searching code.
In Program.cs, replace the
blank Search method with the
following code to pass off search queries to the helper Searcher class. This helper class will
return a List<Document> that you iterate
over and display as the results:
static void Search(string query)
{
var results = Searcher.Search(query);
if (results.Count == 0)
{
Console.WriteLine("No results found!");
return;
}
foreach (var doc in results)
{
Console.WriteLine(doc.Title);
}
}
Compared to the indexing code, the search code is much simpler.
The first pass at your search code will support searching for only one
term. Copy the following code into Searcher.cs and add it to the
project:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Configuration;
using System.Text;
using Microsoft.WindowsAzure;
using Microsoft.WindowsAzure.StorageClient;
using Microsoft.Samples.ServiceHosting.StorageClient;
using PorterStemmerAlgorithm;
using System.Threading;
namespace FTS
{
public class Searcher
{
public static List<Document> Search(string queryTerms)
{
var account =
CloudStorageAccount.Parse(ConfigurationSettings.AppSettings
["DataConnectionString"]);
var ctx = new FTSDataServiceContext(
account.TableEndpoint.ToString(), account.Credentials);
//Clean up query terms
queryTerms = queryTerms.Trim().ToLower();
var stemmer = new PorterStemmer();
// At this time, we support only one term in the query. Let's stem it
var stemmedTerm = stemmer.stemTerm(queryTerms);
// Look up all inverted index entries for the stemmed term
var indexQuery = from indexEntry in
ctx.CreateQuery<IndexEntry>(FTSDataServiceContext.IndexTableName)
where indexEntry.Term == stemmedTerm
select indexEntry;
var results = new List<Document>();
// For every inverted index entry in which the stemmed
// term exists, look up the Document
// and add to our list of results
foreach (IndexEntry indexEntry in indexQuery)
{
var docQuery = from doc in
ctx.CreateQuery<Document>
(FTSDataServiceContext.DocumentTableName)
where doc.ID == indexEntry.DocID
select doc;
results.Add(docQuery.FirstOrDefault());
}
// Return our list of results
return results;
} }
}
The code first creates an FTSDataServiceContext with which to perform
queries later. At this time, let’s deal only with single search
terms—you’ll add support for dealing with multiple search terms soon.
Since you have only a single search term, you can directly case-fold
into lowercase and stem it.
You then query the inverted index table using the stemmed term
and look up all entries that have the stemmed term. This returns a
list of document IDs (if the search was successful). Since displaying
document IDs in the results is not useful, you look up the Document entity for each of these document
IDs so that you can display the title of each of these results.
At this point, you have a fully functional FTS engine that is
good to go. Press F5 and run your project. Search for any term that
appears in the text you’ve indexed by typing search followed by the term. Since a few
classic literary works have been used in this sample, the following
sample queries yield good results.
First, let’s try a search that should yield only one
result:
>search Dracula
Dracula - Bram Stoker
Now, let’s try a common word across many books that should yield
several results:
>search war
Count of Monte Cristo - Alexandre Dumas.txt
Art of War - Sun Tzu.txt
Dracula - Bram Stoker
Fairy Tales - The Brothers Grimm.txt
Around the World in Eighty Days - Jules Verne
On the Origin of Species - Charles Darwin
As you can see, basic search works well. But the key feature in
an FTS engine is the ability to deal with spelling variations.
Let’s try various versions of the last search. Try searching for
wars or warring. You’ll find
that you get the same set of results, even if the variant
you typed doesn’t occur verbatim in the text. That’s
exactly what you want, because users rarely know the precise form that
occurs in the source text or content.
3.9. Searching multiple terms
The preceding code searches for a single term only. But what
about multiple terms where you want to do a Boolean AND and show results only where all terms
are present?
To do so, you must modify the previous code to search for each
individual term, and perform an intersection of all the terms. Doing
so in sequence would be slow. To speed things up, you queue up all the
requests in parallel using a thread queue, and wait for them to finish
using a ManualResetEvent for each
request. Therefore, the total time depends only on the slowest
request, rather than being the sum of the time taken for all
requests.
To perform the intersection, you use .NET 3.5’s built-in
HashSet class, as shown in the following code.
Replace your earlier implementation of Search with the following one:
public static List<Document> Search(string queryTerms)
{
var account =
CloudStorageAccount.Parse(ConfigurationSettings.AppSettings
["DataConnectionString"]);
var terms = queryTerms.Contains(' ')?
queryTerms.ToLower().Trim().Split(' ') : new
string[1]{queryTerms} /* Just one term */;
var resetEvents = new ManualResetEvent[terms.Length];
var allResults = new HashSet<Document>[terms.Length];
for (int i = 0; i < terms.Length; i++)
{
resetEvents[i] = new ManualResetEvent(false);
ThreadPool.QueueUserWorkItem(new WaitCallback((object index) =>
{
var ctx =
new FTSDataServiceContext(account.TableEndpoint.ToString(),
account.Credentials);
var stemmer = new PorterStemmer();
var stemmedTerm = stemmer.stemTerm(terms[(int)index]);
var indexQuery = from indexEntry in
ctx.CreateQuery<IndexEntry>
(FTSDataServiceContext.IndexTableName)
where indexEntry.Term == stemmedTerm
select indexEntry;
var results = new HashSet<Document>();
foreach (IndexEntry indexEntry in indexQuery)
{
var docQuery = from doc in
ctx.CreateQuery<Document>
(FTSDataServiceContext.DocumentTableName)
where doc.ID == indexEntry.DocID
select doc;
results.Add(docQuery.FirstOrDefault());
}
allResults[(int)index] = results;
resetEvents[(int)index].Set();
}), i);
}
//Wait for all parallel queries to finish executing
WaitHandle.WaitAll(resetEvents);
// Use HashSet's intersection ability. Set it to the first term's
//results and intersect with
// results for all other terms. Though the extension method
// Intersect works with any IEnumerable,
// using HashSet gives us the best search performance
IEnumerable<Document> finalResults =
(IEnumerable<Document>)allResults[0];
foreach (var termResults in allResults)
{
finalResults = finalResults.Intersect(termResults);
}
return finalResults.ToList<Document>();
}
You can now run the project, search for multiple terms, and look
for results that contain all terms. Here’s a search query where each
term exists in multiple books, but the combination exists in only
one:
>search france dantes Marseilles
Count of Monte Cristo - Alexandre Dumas.txt
3.10. Ideas for improvement
This discussion has barely scratched the tip of the proverbial
iceberg. Information retrieval is a vast field, and there is a lot of
existing research and code on indexing, stemming, word breaking,
ranking, and various other aspects of a search.
There is also room for improvement in how you use Windows Azure
tables. For example, this examination did not include snippets of the
results and where they appear inside each book. Indexing is quite slow
in this current implementation, because you make a round trip to the
server per term, and it will need to speed up before it can be used
for interactive data entry, as opposed to a background batch
job.
Depending on your scenario, there is a lot of scope for
improvement on the ideas presented so far.